Term Generalization and Synonym Resolution for Biological Abstracts: Using the Gene Ontology for Subcellular Localization Prediction
نویسندگان
چکیده
The field of molecular biology is growing at an astounding rate and research findings are being deposited into public databases, such as Swiss-Prot. Many of the over 200,000 protein entries in Swiss-Prot 49.1 lack annotations such as subcellular localization or function, but the vast majority have references to journal abstracts describing related research. These abstracts represent a huge amount of information that could be used to generate annotations for proteins automatically. Training classifiers to perform text categorization on abstracts is one way to accomplish this task. We present a method for improving text classification for biological journal abstracts by generating additional text features using the knowledge represented in a biological concept hierarchy (the Gene Ontology). The structure of the ontology, as well as the synonyms recorded in it, are leveraged by our simple technique to significantly improve the F-measure of subcellular localization text classifiers by as much as 0.078 and we achieve F-measures as high as 0.935.
منابع مشابه
Term Generalization and Synonym Resolution for Biological Abstracts: Using the Gene Ontology for Subcellular Localization Prediction
The field of molecular biology is growing at an astounding rate and research findings are being deposited into public databases, such as Swiss-Prot. Many of the over 200,000 protein entries in Swiss-Prot 49.1 lack annotations such as subcellular localization or function, but the vast majority have references to journal abstracts describing related research. These abstracts represent a huge amou...
متن کاملImproving subcellular localization prediction using text classification and the gene ontology
MOTIVATION Each protein performs its functions within some specific locations in a cell. This subcellular location is important for understanding protein function and for facilitating its purification. There are now many computational techniques for predicting location based on sequence analysis and database information from homologs. A few recent techniques use text from biological abstracts: ...
متن کاملMultiLoc2 and SherLoc2: improved prediction of subcellular protein localization
The function of a protein is highly correlated with its subcellular localization. However, determining the subcellular localization of a protein experimentally can be difficult and time-consuming. Computational methods for the prediction of subcellular locations of proteins from the sequence alone are an attractive alternative. MultiLoc2 [1] and SherLoc2 [3] both significantly extend and improv...
متن کاملMolecular Characterization of the Epstein-Barr Virus BGLF2 Gene, its Expression, and Subcellular Localization
Background: Epstein–Barr virus (EBV) is a universal herpes virus which can cause a life-long and largely asymptomatic infection in the human population. However, the exact pathogenesis of the EBV infection is not well known.Objective: A comprehensive bioinformatics prediction was carried out for investigating the molecular properties of the BGLF2 and to a...
متن کاملPrediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks
Background: Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from pro...
متن کامل